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Abstract 

Approximate Bayesian Computation (ABC) can be viewed as an analytic approximation of an 
intractable likelihood coupled with an elementary simulation step. Such a view, combined with 
a suitable instrumental prior distribution permits maximum-likelihood (or maximum-a-posteriori) 
inference to be conducted, approximately, using essentially the same techniques. An elementary 
approach to this problem which simply obtains a nonparametric approximation of the likelihood 
surface which is then used as a smooth proxy for the likelihood in a subsequent maximisation step 
is developed here and the convergence of this class of algorithms is characterised theoretically. The 
use of non-sufficient summary statistics in this context is considered. Applying the proposed method 
to four problems demonstrates good performance. The proposed approach provides an alternative 
for approximating the maximum likelihood estimator (MLE) in complex scenarios. 

Keywords: Approximate Bayesian Computation; Density Estimation; Maximum Likelihood Esti- 
mation; Monte Carlo Methods. 



1 Introduction 

Modern applied statistics must deal with many settings in which the pointwise evaluation of the like- 
lihood function, even up to a normalising constant, is impossible or computationally infeasible. Areas 
such as financial modelling, genetics, geost atistics, neurophysiology and stochastic dyn ami cal systems 



provi de numerous examples of this (see e.g. 



Cox and Smith. 



1954; 



Pritchard et al. 



1999 



; and 



Toni et al. 



2009). It is consequently difficult to perform any inference (classical or Bayesian) about the parame- 



ters of the model. Various appr oaches to overcome t his difficulty have been proposed. For instance, 



Composite Likelihood methods (iCox and Reid , 



2004) . for approximating the likelihood function, an d 



Pritchard et al 



1999; 



Beaumont et al. 



2002), 



Approximate Bayesian Computational methods (ABC; 
for approximating the posterior distribution, have been extensively studied in the statistical literature. 
Here, we study the use of ABC methods, under an appropriate choice of instrumental prior distribution, 
to approximate the maximum likelihood estimator. 
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It is well-known that ABC produces a sample approximation of the posterior distribution (IBeaumont et al. 



2002) in which there exists a deterministic approximation error in addition to Monte Carlo variabil- 



ity. The quality of the approximation t o the poste r ior an d theoretical 



tained with ABC have been studied in 



was studied in 



Didelot et al 



Wilkinsonl (120081) : 



p ropert ie s of the est i mator s ob- 



Marin et al 



d2on|); 



Dean et al 



(120111) and 



Fearnhead and Prangle (12012). Th e us e of ABC posterior 



(1201 lh and 



Robert et al. 



(2011 



samples for conducting model comparison 
). Using this sample approximation to char- 



acterise the mode of the posterior would in principle allow (approximate) maximum a posteriori (MAP) 
estimation. Furthermore, using a uniform prior distribution, under the parameterisation of interest, over 
any set which contains the MLE will lead to a MAP estimate which coincides with the MLE. In low- 
dimensional problems if we have a sample from the posterior distribution of the parameters, we can 
estimate its mode by using either nonpar ametric estimators of the density or another mode-seeking tech- 



nique such as the mean-shift algorithm (IFukunaga and Hostetler 



19751) . Therefore, in contexts where 



the likelihood function is intractable we can use these results to obtain an approximation of the MLE. 
We will deno te the estimator obtaine d with this approximation AMLE. 



Although Marjoram et al. 



(2003) noted that "It [ABC] can also be used in frequentist applications, 
in particular for maximum-likelihood estimation" this idea does not seem to have been developed. A 
method b ased around maximisation o f a non-parametric estimate of the log likelihood function was pro- 



posed by 



Diggle and Gratton 



(11984T) in the particular case of simple random samples; their approach 



involved samplin g numerous replica tes of the data for each parameter value and estimating the density 



in the data space. Ide Valpind (120041) proposes an importance sampling technique, rather closer in spirit 
to the approach developed here, by which a smoothed kernel estimation of the likelihood function up 
to a proportionality constant can be obtained in the particular case of state space models provided that 
techniques for sampling from the joint distribution of unknown parameters and latent states are avail- 
able — not a requirement of the more general ABC technique developed below. The same idea was 
applied and analysed in the context of the estimat ion of location parameters, with particular emphasis on 



symmetric distributions, by 



Markov models was also investigated by 



Jaki and West! (120081). The partic ular case of parameter estimation in hidden 



Dean et al. 



(12011 



), who relied upon the specific structure of 



that problem. To the best of our knowledge neither MAP estimation nor maximum likelihood estimation 
in general, implemented directly via the "ABC approximation" combined with maximisation of an esti- 

ot of interest in this type 



mated density, have been studied in the literature. However, there has been a 



of problem using different approaches (ICox and KartsonakiL 



201: 



Fan et al. 



2012; 



Mengersen et al. 



2 



2012D since we completed the first version of this work (IRubio and Johansenl . 



2012h . 



The estimation of the mode of nonparametric kernel densit y estimators which may seem, at first, to 



be a hopeless task has also received a lot of atten t ion (s ee e.g. 



1988 



Abraham et al. 



2003 



Bickel and Friiwirth . 



Parzen 



19621 : 



Konakov . 



19731 : 



Romano, 



20061) . Alternative nonparametric density estimators 



which could also be con sidered within the AMLE context have been proposed recently in 



Cule et al. 



d2010b; 



Jing et al 



20121) . 

The remainder of this paper is organised as follows. In Section |2j we present a brief description 
of ABC methods. In Section [3] we describe how to use these methods to approximate the MLE and 
present theoretical results to justify such use of ABC methods. In Section |4j we present simulated 
and real examples to illustrate the use of the proposed MLE approximation. Section [5] concludes with 
a discussion of both the developed techniques and the likelihood approximation obtained via ABC in 
general. 



2 Approximate Bayesian Computation 

We assume throughout this and the following section that all distributions of interest admit densities 
with respect to an appropriate version of Lebesgue measure, wherever this is possible, although this 
assumption can easily be relaxed. Let x = (x\, ...,2%) £ M. gxn be a simple random sample from a 
distribution f(-\6) with support contained in M. q , 6 G C M, d ; C(9; x) be the corresponding likelihood 
function, ir(0) be a prior distribution over the parameter 6 and tt(0|x) the corresponding posterior 
distribution. Consider the following approximation to the posterior 

». (8W _ fol'M«> , (!) 

/ e /«(x|tWt)dt 

where 

/ £ (x|0) = I if £ (x|y)/(y|0)dy, (2) 

is an approximation of the likelihood function and K £ (x\y) is a normalised Markov kernel. K e (-\y) 
is typically concentrated around y with e acting as a scale parameter. It's clear that (2) is a smoothed 
version of the true likelihood and it has been argued that the maximisation of such an approxim ation can 



in some circumstances lead to better performance than the maximisation of the likelihood itself dlonidesl . 
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20051) , providing an additional motivation for the investigation of MLE via this approximation. The 
approximation can be further motivated by noting that under weak regularity conditions, the distribution 
7r e (0|x) is close (in some sense) to the true posterior 7r(0|x) when e is sufficiently small. The simplest 
approach to ABC samples directly from ([1]) by the rejection sampling approach presented in Algorithm 
□ 

Algorithm 1 The basic ABC algorithm. 



Simulate 6' from the prior distribution ir(-). 
Generate y from the model f{-\6'). 

Accept 0' with probability oc i^ e (x|y) otherwise return to step 1. 



Now, let r] : W 1 ^ ->• R m be a summary statistic, p:f m xK m -f R + be a metric and e > 0. The 
simplest ABC algorithm can be formulated in this way using the kernel 



K E (x\y) oc < 



1 if p(r/(x),r ? (y)) < e, 
otherwise. 



(3) 



The ABC rejection algorithm of 



Pritchard et al. 



(1999) can be obtained simply by setting ?y(x) = x. 



Sev eral improvemen t s to th e ABC method h a ye bee n pr oposed in orde r to increase the acceptance rate, 



see 



Beaumont et al. 



(2002), 



Marjoram et al. 



d2003l) and 



Sissonl (120071) for good surveys of these. An 



exhaustive summary of these developments falls outside the scope of the present paper. 



3 Maximising Intractable Likelihoods 



3.1 Algorithm 



Point estimation of 6, by MLE and MAP estimation in particular, has been extensively studied (ILehmann and Casella 



19981) . Recall that the MLE, 0, and the MAP estimator are the values of 6 which maximise the likeli- 
hood or posterior density for the realised data. 

These two quantities coincide when the prior distribution is constant (e.g. a uniform prior 7r(#) 
on a suitable (necessarily compact) set D which contains 6). Therefore, if we use a suitable uniform 
prior, it is possible to approximate the MLE by using ABC methods to generate an approximate sample 
from the posterior and then approximating the MAP using this sample. In a different context in which 
the likelihood can be evaluated pointwise, simulation-based MLEs which use a similar construction 
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have been shown to perform well (see, e.g. JGaetan and Yaol 120031 . ILele et all 120071 and lJohansen et al. 



2008). In the present setting the optimisation step can be implemented by estimating the posterior 
density of 6 using a nonparametric estimator (e.g. a kernel density estimator) and then maximising this 
function: Algorithm [2] 

We note that we have not here considered similar simulation-based approaches to the direct optimi- 
sation of the likelihood function for a number of reasons. One is computational cost (not having access 
to the likelihood even pointwise means that distributions concentrated around the mode could be con- 
structed only by introducing several replicates of the data and the rejection or other mechanism used to 
produce samples from this distribution will become increasingly inefficient as the number of replicates 
increases); another is that the proposed method has the additional advantages that it fully characterises 
the likelihood surface and can be conducted concurrently with Bayesian analysis with no additional 
simulation effort. 

Algorithm 2 The AMLE Algorithm 

1: Obtain a sample 6* n £ = (0J^ CjlJ 0* m £ m ) from tt £ (0|x). 

2: Using the sample 0* m £ construct a nonparametric estimator tp of the density 7? e (0|x). 
3: Calculate the maximum of tp, 6 m ^ e . This is an approximation of the MLE 0. 



Note that the first step of this algorithm can be implemented rather generally by using essentially 
any algorithm which can be used in the standard ABC context. It is not necessary to obtain an iid sample 
from the distribution tt £ : provided the sample is appropriate for approximating that distribution it can 
in principle be employed in the AMLE context (although correlation between samples obtained using 
MCMC techniques and importance weights and dependence arisi ng from the use of S MC can complicate 



2003)). 



density estimation, it is not as problematic as might be expected dSkold et al. 

A still more general algorithm could be implemented: using any prior which has mass in some 
neighbourhood of the MLE and maximising the product of the estimated likelihood and the reciprocal of 
this prior (assuming that the likelihood estimate has lighter tails than the prior, not an onerous condition 
when density estimation is used to obtain that es timate) will also p rovide an estimate of the likelihood 



maximiser, an approach which was exploited by 



de Valpine (2004) (who provided also an analysis of 



the smoothing bias produced by this technique in their context). In the interests of parsimony we do not 
pursue this approach here, and throughout the remainder of this document we assume that a uniform prior 
over some set D which includes the MLE is used, although we note that such an extension eliminates 
the requirement that a compact set containing a maximiser of the likelihood be identified in advance. 
One obvious concern is that the approach could not be expected to work well when the parame- 



ter space is of high dimension: it is well known that density estimators in high-dimensional settings 
converge very slowly. Three things mitigate this problem in the present context: 

• Many of the applications of ABC have been to problems with extremely complex likelihoods 
which have only a small number of parameters (such as the examples considered below). 

• When the parameter space is of high dimension one could employ composite likelihood techniques 
with low-dimensional components estimated via AMLE. Provided appropriate parameter subsets 
are selected, the loss of efficiency will not b e too severe in many cases. Alt ernatively, a different 



mode-seeking algorithm could be employed (IFukunaga and Hostetlerl . 



1975). 



• In certain contexts, as discussed below Proposition [2] it may not be necessary to employ the 
density estimation step at all. 

Finally, we note that direct maximisa tion of the sm oothed likelihood approximation © can be inter- 



preted as a pseudo-likelihood technique (IBesagL 



19751) . with the Monte Carlo component of the AMLE 



algorithm providing an approximation to this pseudo-likelihood. 
3.2 Asymptotic Behaviour 

In this section we provide some theoretical results which justify the approach presented in Section 13.11 
under similar conditions to those used to motivate the standard ABC approach. We assume throughout 
that the MLE exists in the model under consideration but that the likelihood is intractable; in the case of 
non-compact parameter spaces, for example, this may require verification on a case-by-case basis. 

We begin by showing pointwise convergence of the posterior (and hence likelihood) approximation 
under reasonable regularity conditions. It is convenient first to introduce the following concentration 
condition on the class of ABC kernels which are employed: 

Condition K A family of symmetric Markov kernels with densities K £ indexed by e > is said to 
satisfy the concentration condition provided that its members become increasingly concentrated 
as e decreases such that 

/ K e (x|y)dy= / tf e (y|x)dy = 1, Ve > 0. 
jb e (x) ie E (x) 

where £> e (x) := {z : jz — x| < e}. 
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As the user can freely specify K this is not a problematic condition. It serves only to control the 
degree of smoothing which the ABC approximation of precision e can effect. 

Proposition 1. Let x = (xi, ...,x n ) € W }Xn be a sample from a continuous distribution f{-\6) with 
support contained in W 1 , 6 6 O C M. d ; DcR^ffl compact set that contains 6, the MLE of 6; and let 
K £ be the densities of a family of symmetric Markov kernels, which satisfies the concentration condition 
(K). 

Suppose that 



sup /(t|0) < oo, 

(t,6»)GB E (x)xD 



for some e > 0. Then, for each 9 € D 



lim 7r e (#|x) = 7r (6\x) . 
Proof. It follows from the concentration condition that: 

/ E (x|0) = / K e (x.\y)f(y\0)dy. 

Furthermore, for each 6 6 D 

|/ £ (x|0) - /(x|0)| < / dyK £ (x\y) \f(y\9) - f(x\0)\ < sup \f(y\9) - /(x|fl)| (4) 

which converges to as e — > by continuity. Therefore / E (x|0) £ ~ > °) /(x|0). 

Now, by bounded convergence (noting that boundedness of / e (x|0), for e < e, follows from that of 
/ itself), we have that: 

lim [ fJx\G')Tr(0')dO' = [ f(x\0')Tr(9')d9' . (5) 
Jd Jd 

The result follows by combining © and d5), whenever 7r(0|x) is itself well defined. □ 

This result can be strengthened by noting that it is straightforward to obtain bounds on the error 
introduced at finite e if we assume Lipschitz continuity of the true likelihood. Unfortunately, such con- 
ditions are not typically verifiable in problems of interest. The following result, in which we show that 
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whenever a sufficient statistic is employed the ABC approximation converges pointwise to the posterior 
distribution, follows as a simple corollary to the previous proposition. However, we provide an explicit 
proof based on a slightly different argument in order to emphasize the role of sufficiency. 

Corollary 1. Let x = (xi, x n ) € R qxn be a sample from a distribution f(-\0) over M. q , rj : M. n ' q — > 
R m be a sufficient statistic for G C M d , p : M. m x R m — > R + be a metric and suppose that the 
density ofr], f v (-\0), is p— continuous for every 9gD. Let D C M. d be a compact set, suppose that 

sup P{t\9) < oo, 
(t,0)eZ3 £ xD 

where B t = {t € W 71 : p(r/(x),t) < e}/or some e > fixed. Then, for each € D awe? me kernel ([3]) 



lim7r e (0|x) = tt(0|x) 

e— »0 



Proof. Using the integral Mean Value Theorem (as used in a similar context by Dean et al.l (1201 ll . Equa- 
tion 6)) we find that for € D and any e G (0, e): 

/ £ (x|0)cx J I(p(T,<y)M*))<e)f<y\0)<ly = J B f'(T/\0)drf = \(B e )F(S(0,x,e)\e), 

for some £(0, x, s) € S E , where A is the Lebesgue measure. Then 

/" (£(0,x, £ )|0)tt(0) 



vr £ (0|x) 



^PW>,X,E)\0l)*(0l)d0'- 

As this holds for any sufficiently small e > 0, we have by p— continuity of f v (-\0): 

lim/*(e(0 |X ,e)|0) = rfa(x)|0). (6) 

Using the Dominated Convergence Theorem we have 

lim / P(t(0',-x,e)\0')ir(0')d0'= [ P ( V (x)\0') Tt(0')d0' . (7) 

By the Fisher-Neyman factorization Theorem we have that there exists a function h : M n ' 9 — >■ R + such 
that 



/(x|0) = Mx)f fa(x)|0) 
8 



(8) 



The result follows by combining ([7J) and (JSJ.D 

With only a slight strengthening of the conditions, Proposition Q] allows us to show convergence of 
the mode as e — > to that of the true likelihood. It is known that pointwise convergence t ogether with 



1976; 



Whitney, 



199 lh . Therefore, 



equicontinuity on a compact set implies uniform convergence (IRudiri . 
if in addition to the conditions in Proposition Q] we assume equicontinuity of 7? e (-|x) on D, a rather 
weak additional condition, then the convergence to 7r(-|x) is uniform and we have the following direct 
corollary to Proposition [T| 

Corollary 2. Let 7? e (-|x) achieve its global maximum at 6 £ for each e > and suppose that vr(-|x) has 
unique maximiser 0. Under the conditions in Proposition^ if7r e (-\x.) is equicontinuous, then 



lim 7r e (0 e |x) = 7r(0|x). 



Using these results we can show that for a simple random sample 6* n £ = (0^ E l5 Q* m e m ) from 
the distribution 7? e (-|x) with mode at 6 £ and an estimator m)£ , based on 6* n £ , of 6 £ , such that G mj£ — > 
6 £ almost surely when m — > oo, we have that for any 7 > there exists e > such that 



lim 7r e (0 m Jx) - 7r [6\x] < 7, a.s. 



That is, in the case of a sufficiently well-behaved density estimation procedure, using the simple 
form of the ABC estimator (Algorithm [T} we have that for any level of precision 7, the maximum of the 
AMLE approximation will, for large enough ABC samples, almost surely be 7— close to the maximum 
of the posterior distribution of interest, which coincides with the MLE under the given conditions. A 
simple continuity argument suffices to justify the use of m>£ to approximate 6 for large m and small e. 

The convergence shown in the above results depends on the use of a sufficient statistic. In contexts 
where the likelihood is intractable, this may not be available. In the ABC literature, it has become 
common to employ summary statistics which are not sufficient in this setting. Although it is possible to 
characterise the likelihood approximation in this setting, it is difficult to draw useful conclusions from 
such a charac terisation. The con str uction of appropriate summary statistics remains an active research 



area (see e.g. 



Peters et al. 



2010 and 



Fearnhead and Prangle 



2012h . 



We finally provide one result which provides some support for the use of certain non-sufficient 
statistics when there is a sufficient quantity of data available. In particular we appeal to the large-sample 
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limit in which it can be seen that for a class of summary statistics the AMLE can almost surely be made 
arbitrarily close to the true parameter value if a sufficiently small value of e can be used. This is, of 
course, an idealisation, but provides some guidance on the properties required for summary statistics 
to be suitable for this purpose and it provides some reassurance that the use of such statistics can in 
principle lead to good estimation performance. In this result we assume that the AMLE algorithm is 
applied with the summaiy statistics filling the role of the data and hence the ABC kernel is denned 
directly on the space of the summary statistics. 

In order to establish this result, we require that, allowing ?7 n (x) = %(a?i, ... ,x n ) to denote a 
sequence of d v -dimensional summary statistics, the following four conditions hold: 

S.i lim^oo T} n (x) = g{9) for ir - a.e. 9 

S.ii g : — > M. d i is an injective mapping. Letting H = g(D) C R. dri denote the image of the feasible 
parameter space under g, g" 1 : H — > is an a-Lipschitz continuous function for some a £ M.+. 

S.iii The ABC kernels, defined in the space of the summary statistics, satisfy condition K, i.e. Ke(-\rf) 
it is concentrated within a ball of radius e for all e: supp K e (-\r)') C B e (rf) 

S.iv The nonparametric estimator used always provides an estimate of the mode which lies within the 
convex hull of the sample. 

Some interpretation of these conditions seems appropriate. The first tells us simply that the summary 
statistics converge to some function of the parameters in the large sample limit, a mild requirement 
which is clearly necessary to allow recovery of the parameters from the statistics. The second condition 
strengthens this slightly, requiring that the limiting values of the statistics and parameters exist in one- 
to-one correspondence and that this correspondence is regular in a Lipschitz-sense. The remaining 
conditions simply characterise the behaviour of the ABC approximation and the AMLE algorithm. 

Proposition 2. Let x = (xi,X2, . . .) denote a sample with joint measure p,(-\9) for some 9 € D C 
0. Let it (9) denote a prior density over D. Let rj n (x) = rj n (xi, . . . ,x n ) denote a sequence of d^- 
dimensional summary statistics with distributions pP n {-\9). Allow rj* to denote an observed value of the 
sequence of statistics obtained from the model with 9 = 9*. 

Assume that conditions S.i— S.iv hold. Then: 

(a) supp linij^oo 7r e (9\r]*) C B a£ (9*) for ^{■\9*)-almost every rf for n-almost every 9*. 
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(b) The AMLE approximation of the MLE lies within B ae {9*) almost surely. 

Proof. Allowing fe n (rj\6) to denote the ABC approximation of the density of rj n given 6, we have: 
lim f r >Hv\0)= lim / ^(d V '\0)K £ ( V W) a = s [ 5 g{0) (d n ')KMv') = 

n— >oo n— yoo J J 

where with the final equality following from S.i (noting that almost sure convergence of r\ n to g(6) 
implies convergence in distribution of rj n to a degenerate random variable taking the value g{6)). 
From which it is clear that supp linin^oo fe n (-\0) C B e (g{6)) by S.iii. 

And the ABC approximation to the posterior density of 6, lim n _ >oc ir £ (-\r] n ), may be similarly con- 
strained: 

lim ir £ {e\r] n ) > lim \\rj n - g(0)\ \ "<£ lim Wg^ivn) - 0\ \ °<a£ 

n— >oo n— >oo n— >oo 

using S.ii. And by assumption S.i, S.ii and the continuous mapping theorem we have that g~ l (r/* ) ^1 0* 
giving result (a); result (b) follows immediately from S.iv. □ 

It is noteworthy that this proposition suggests that, at least in the large sample limit, one can use 
any estimate of the mode which lies within the convex hull of the sampled parameter values. The 
posterior mean would satisfy this requirement and thus for large enough data sets it is not necessary 
to employ the nonparametric density estimator at all in order to implement AMLE. This is perhaps an 
unsurprising result and seems a natural consequence of the usual Bayesian consistency results but it 
does have implications for implementation of AMLE in settings with large amounts of data for which 
the summary statistics are with high probability close to their limiting values. 

Example 1 (Location-Scale Families and Empirical Quantiles). Consider a simple random sample from 
a location-scale family, in which we can write the distribution functions in the form: 

F(xi\n,a) = F ((xi - n)/o) 

Allow = F~ 1 (qi) and rfc = i* 1-1 ^) to denote to empirical quantiles. By the Glivenko-Cantelli 
theorem, these empirical quantiles converge almost-surely to the true quantiles: 



lim 




a.s 


( 


F~ 














V "» J 






F~ 
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In the case of the location-scale family, we have that: 



F- 1 (q^,a)=aF -\q*) + v 



and we can find explicitly the mapping g : 

I 



9 1 (vi,Vn' 



\ 







1 a.s 




r 





provided that F 1 (qi) ^ F Q 1 (q2) which can be assured if Fq is strictly increasing and q± ^ q2- In this 
case we even obtain an explicit form for a. 



3.3 Use of kernel density estimators 

In this section we demonstrate that the simple Parzen estimator can be employed within the AMLE 
context with the support of the results of the previous section. 



Definition 1. nParzen , 



196a) Consider the problem of estimating a density with support on W 1 from m 



independent random vectors (Zi 



Let K be a kernel, h m be a bandwidth such that h r 







when m — > oo, then a kernel density estimator is defined by 



3=1 

Under the conditions h m — > and mh 7 ^/ log(m) —> oo together with Theorem 1 from 



Abraham et al 



(120031) . we have that m ^ 6 as m — > oo. Therefore, the results presented in the previous section ap- 
ply to the use of kernel density estimation. This demonstrates that this simple non-parametric estimator 
is adequate for approximation of the MLE via the AMLE strategy, at least asymptotically. 

This is, of course, just one of many ways in which the density could be estimated and more sophis- 
ticated techniques could very easily be employed and justified in the AMLE context. 

4 Examples 

We present four examples in increasing order of complexity. The first two examples illustrate the per- 
formance of the algorithm in simple scenarios in which the solution is known; the third compares the 
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algorithm with a quantile-based method in a setting which has recently been studied using ABC and 
the final example demonstrates performance on a challenging estimation problem which has recently 
attracted some attention in the literature. In all the examples the simple ABC rejection algorithm was 
used, together with ABC kernel ([3]). For the second, third and fourth examples, kernel density estimation 
is conducted using the R co mmand 'kde' together with t he bandwidth matrix obtained via the smoothed 



cross validat i on app roach of 



ks' (IDuongl . 



Duong and Hazeltonl (12005b using the command 'Hscv' from the R package 



20111) . R source code for these examples is available from the first author upon request. 



4.1 Binomial Model 

Consider a sample of size 30 simulated from a Binomial(10, 0.5) with x = 5.53. Using the prior 
9 ~ Unif (0, 1), a tolerance e = 0.1, a sufficient statistic ry(x) = x and the Euclidean metric we simulate 
an ABC sample of size 10, 000 which, together with Gaussian kernel estimation of the posterior, gives 
the AMLE = 0.552. 

There are three quantities affecting the precision in the estimation of 6: D, n and e. Figured] illus- 
trates the effect of varying n G {30, 100, 1000, 2000, 10000} for a fixed s, two different choices of 
D and an ABC sample of size 10, 000. Boxplots were obtained using 100 replications of the (stochas- 
tic) AMLE algorithm. This demonstrates that although, unsurprising the acceptance rate and hence 
computational efficienct is improved when some D which is relatively concentrated around the MLE is 
available, estimation precision remains good when the full support of the parameter space is included in 
D albeit at greater computational cost (the choice D = (0.45, 0.65) produces an acceptance rate about 
5 times greater than the choice D = (0, 1)). Figure [2] shows the effect of e € {1, 0.9, 0.1, 0.05, 0.01} 
for a fixed n and two different choices of D. In this case we can note that the effect of e on the precision 
is significant. Again, the choice of D affects only the acceptance rate. 
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O -j- 



(a) D = (0.45,0.65) 



(b) D = (0, 1) 



Figure 1: Effect of n € {30, 100, 1000, 2000, 10000} for e = 0.05. The continuous red line repre- 
sents the true MLE value. 



(a) 



(b) 



Figure 2: Effect of e e {1, 0.9, 0.1, 0.05, 0.01} for n 
continuous red line represents the true MLE value 



10000: (a) D = (0.45,0.65); (b) D = (0, 1). The 
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(a) (b) 
Figure 3: Monte Carlo variability of the AMLE: (a) fi; (b) a. The dashed lines represent the true MLE value. 

4.2 Normal Model 

Consider a sample of size 100 simulated from a Normal(0, 1) with sample mean x = —0.005 and sample 
variance s 2 = 1.004. Suppose that both parameters (/x, a) are unknown. The MLE of (/i, a) is simply 
(£,5) = (-0.005,1.002). 

Consider the priors fi ~ Unif(— 0.25, 0.25) and a ~ Unif (0.75, 1.25) (crude estimates of location 
and scale can often be obtained from data, justifying such a choice; using broader prior support here 
increases computational cost but does not prevent good estimation), a tolerance e = 0.01, a sufficient 
statistic ry(x) = (x, s), the Euclidean metric, an ABC sample of size 5, 000, and Gaussian kernel esti- 
mation of the posterior. Figure [3] illustrates Monte Carlo variability of the AMLE of (/x, a). Boxplots 
were obtained using 50 replicates of the algorithm. 



4.3 Financial application 

Logarithmic daily return prices are typically modelled using Levy processes. For this reason, it is neces- 
sary to model the increments (logarithmic returns) using an infinitely divisible distribution. It has been 
found empirically that these observations have tails heavier than those of the normal distribution, and 
therefore an attractive option is the use of the 4— parameter (a, /3, /i, a) a— stable family of distributions, 
which can account for this behaviour. It is well known that maximum likelihood estimation for this fam- 



e.g 



McCulloch. 


198fJ: 


Nolan. 


2001 



Peters et al 



(2010) proposed the 



). From a Bayesian perspective, 
use of ABC methods to obtain an approximate posterior sample of the parameters. They propose six 
summary statistics that can be used for this purpose. 

Here, we analyse the logarithmic daily returns using the closing price of IBM ordinary stock from 
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Jan. 1 2009 to Jan. 1 2012. Figure [J] shows the corresponding histogram. For this data set, the M LE 



2010h is 



using McCulloch's quantile method implemented in the R package 'fBasics' (IWuertz et al. 
(a, j3, ft, a)= (1.4930, -0.0780, -0.0007, 0.0073). 

Given the symmetry observed and in the spirit of parsimony, we consider the skewness parameter (3 
to be in order to calculate the AMLE of the parameters (a, //, a). Based on the interpretation of these 
parameters (shape, location and scale) and the data we use the priors 

a ~ 17(1, 2), fi ~ U(-0.1, 0.1), a ~ [7(0.0035, 0.0125) 

which due to the scale of the data may appear concentrated but are, in fact, rather uninformative, allowing 
a location parameter essentially anywhere within the convex hull of the data, scale motivated by similar 
considerations and any value of the shape parameter consistent with the problem at han d- 



Peters et al 



(I2010h . which consists 



For the (non-sufficient) summary statistic we use proposal S4 of y 
of the values of the empirical characteristic function evaluated on an appropriate grid. We use the grid 
t € {-250, -200, -100, -50, -10, 10, 50, 100, 200, 250}, an ABC sample of size 2,500, a tolerance 
e G {0.5, 0.4, 0.3, 0.2, 0.125} and Gaussian kernel density estimation. Figure [5] illustrates Monte Carlo 
variability of the AMLE of (a,fi,a). Boxplots were obtained using 50 replicates of the AMLE pro- 
cedure. A simulation study considering several simulated data sets produced with common parameter 
values (results not shown) suggest that the sampling variability in McCulloch's estimator exceeds the 
difference between that estimator and the AMLE based upon S4. In general, considerable care must of 
course be taken in the selection of statistics — it is noteworthy that the quantiles used in McCulloch's 
own estimator satisfy most of the requirements of Proposition although it is not clear that it is possible 
to demonstrate the Lipschitz-continuity of g^ 1 in this case. 
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Figure 4: Logarithmic daily returns using the closing price of IBM ordinary stock Jan. 1 2009 to Jan. 1 2012. 



0.3 0.2 0.125 



0.3 0.2 0.125 



0.3 0.2 0.125 



(a) 



(b) 



(c) 



Figure 5: Monte Carlo variability of the AMLE: (a) a; (b) /i; (c) a. Horizontal lines represent McCulloch's 
estimator produced by the R package 'ffiasics'. 



4.4 Superposed gamma point processes 

The modelling of an unknown number of superposed gamma point process es provides another scenario 



with intractable likelih oods which is currently attracting some attention (iCox and Kartsonaki 



Mengersen et al. 



2012; 



2012). Intractability of the likelihood in this case is a consequence of the dependency 
between the observations, which complicates the construction of th eir joint distribution. Superposed 



Cox and Smith! (11954) present an 



point processes have applications in a variety of areas, for instance 
application of this kind of processes in the context of neurophysiology. In this example we consider a 
simulated sample of size 88 of N = 2 superposed point processes with inter-arrival times identically 



distributed as a gamma random variable with shape parameter a = 9 and rate parameter (3 = 1 observed 
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in the interval (0, tn ) , with to = 420. This choice is inspired by the simulated example presented in 



Cox and Kartsonakil (12012h . 



In order to make inference on the parameters (A, a, (3) using the AMLE approach, we implement 
two ABC samplers using the priors N ~ Unif{l, 2, 3, 4, 5}, a ~ Unif(5, 15), ~ Unif(0.25, 1.5), 
tolerances e € {0.5, .4,0.3,0.2,0.15} an d two sets of summary statistics. The first set of summary 



statistics, proposed in 



Mengersen et al. 



(2012). 



Cox and Kartsonakil (120121) and subsequently used in 
consists of the mean rate of occurrence, the coefficient of variation of the intervals between succes- 
sive points, the sum of the first five autocorrelations of the intervals, the mean of the intervals, and 
the Poisson indexes of dis persion, variance divided by mean, for intervals of length 1, 5, 10 and 20. 



Cox and Kartsonakil (120121) mention that summary statistics based on the intervals between successive 
points are likely to be useful when A is small, therefore we consider a second set of summary statistics 
by adding a ninth quantity based on the third moment: the sample skewn ess of the intervals between suc- 



cessive points Y^™ .(xj — x) 3 /(y^"_ 1 (x 1 - — x) 2 /n) 3 ^ 2 . Note that, unlike 



Mengersen et al. 



Cox and Kartsonakil (120121 ) and 



(120121) . we are taking the discrete nature of the parameter N into account. The AMLE 
approach is still applicable in this context given that the maximisation of the joint posterior distribution 
of (A, a, /3) can be conducted by conditioning on N. We also considered a continuous prior, uniform 
over [1, 5] and obtained comparable results (not shown) - although, naturally, by using a discrete prior 
on A instead of a continuous one, the uncertainty in the estimation of a and j3 is reduced. Although 
allowing A to take a continuous range of values leads to an analysis which is arguably more immedi- 
ately comparable to those presented previously in the literature, we prefer to restrict A to a discrete set 
as this is consistent with the statistical interpretation of the parameter and the possibility of doing so is 
a clear advantage of the AMLE methodology. Figure [6] shows the Monte Carlo variability, estimated by 
using 50 AMLE samples, for each of the two AMLE approaches based on ABC samples of size 5000. 
We can notice that the precision in the estimation of (a, /3) increases faster, as the tolerance decreases, 
when using 9 summary statistics. We can observe the same phenomenon from Table [T]in the estimation 
of N. (Note that the horizontal line shows the parameters used to generate the data not the true value 
of the MLE). The uncertainty in the estimation of a and /3 using the AMLE approach with the se t of 9 



Cox and Kartsonaki 



(2012) for a 



summary statistics seems to be qualitatively comparable with that in 
small tolerance e. Figure|7]shows scatter plots of the AMLE estimators of /3 and a for e = 0.15 and both 
sets of summaiy statistics. This scatterplot demonstrates that the mean (a//3) of the gamma distribution 
is much more tightly constrained by the data than the shape parameter, leading to a nearly-flat ridge in 
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the likelihood surface. The variability in the estimated value of a/ (3 is, in fact, rather small; while the 
variability in estimation of the shape parameter reflects the lack of information about this quantity in the 
data and the consequent flatness of the likelihood surface. 



n 1 bTb 



0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.15 0.15 



mm 



0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.15 0.15 



(a) 



(b) 



Figure 6: Effect of e e {0.5, 0.4, 0.3, 0.2, 0.15} for n = 5000: (a) a; (b) (3. The AMLE samples with 8 and 9 
summary statistics are presented in white and gray boxplots, respectively. The continuous red line represents the 
true value of the parameter. 
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Figure 7: AMLE estimators of /3 vs. AMLE estimators of a: (a) 8 summary statistics; (b) 9 summary statistics. 
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Table 1: Replicate study with a single data realisation. Estimators of N for different values of e 
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To show the variability of the estimator with different data, we also compare the variability of the 
estimators obtained using 50 different data sets. For each data set we obtain the corresponding AMLE 
of (N,a,(3) by using the priors N ~ Unif{l, 2, 3, 4, 5}, a ~ Unif(5, 13) and p ~ Unif(0.5, 1.5), 
tolerances e € {0.5, 0.4, 0.3, 0.2, 0.15} and the two sets of summary statistics mentioned above. Figure 
[8] shows the boxplots of the AMLEs for (a, (3) obtained using ABC samples of size 5000. We can 
observe that the behaviour of the estimators of (a, j3) is fairly similar for both sets of summary statistics. 
Table Q] also suggests an improvement in the estimation of N produced by the inclusion of the sample 
skewness. 




Figure 8: Effect of e e {0.5, 0.4, 0.3, 0.2, 0.15} for n = 5000: (a) a; (b) (3. The AMLE samples with 8 and 9 
summary statistics are presented in white and gray boxplots, respectively. The continuous red line represents the 
true value of the parameter. 
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Table 2: Replicate study with 50 data realisations. Estimators of A*" for different values of e 
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5 Discussion 



This paper presents a simple algorithm for conducting maximum likelihood estimation via simulation 
in settings in which the likelihood cannot (readily) be evaluated and provides theoretical and empirical 
support for that algorithm. This adds another tool to the "approximate computation" toolbox. This 
allows the (approximate) use of the MLE in most settings in which ABC is possible: desirable both 
in itself and because it is unsatisfactory for the approach to inference to be dictated by computational 
considerations. Furthermore, even in settings in which one wishes to adopt a Bayesian approach to 
inference it may be interesting to obtain also a likelihood-based estimate as agreement or disagreement 
between the approaches. Naturally, both ABC and AMLE being based upon the same approximation, 
the difficulties and limitations of ABC are largely inherited by AMLE. Selection of statistics in the case 
in which sufficient statistics are not availa ble remains a critical question . There has been considerable 



work on this topic in recent years (see e.g. 



Fearnhead and Prangle 



20121) . 



A side-effect of the AMLE algorithm is an approximate characterisation of the likelihood surface, 
or in Bayesian settings of the posterior surface. We would strongly recommend that this surface be in- 
spected whenever ABC or related techniques are used as even in settings in which the original likelihood 
contains strong information about the parameters it is possible for a poor choice of summary statistic 
to lead to the loss of this information. Without explicit consideration of the approximation, perhaps 
combined with prior sensitivity analysis, this type of issue is difficult to detect. 
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